Get started with building a model in this R Markdown document that accompanies Preprocess your data with recipes tidymodels start article.
If you ever get lost, you can click on the header of the knitted document to see the accompanying section in the online article.
Take advantage of RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks.
Load necessary packages:
library(tidymodels) # for the recipes package, along with the rest of tidymodels
# Helper packages
library(nycflights13) # for flight data
library(skimr) # for variable summaries
Load and wrangle data:
set.seed(123)
flight_data <-
flights %>%
mutate(
# Convert the arrival delay to a factor
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
# We will use the date (not date-time) in the recipe below
date = as.Date(time_hour)
) %>%
# Include the weather data
inner_join(weather, by = c("origin", "time_hour")) %>%
# Only retain the specific columns we will use
select(dep_time, flight, origin, dest, air_time, distance,
carrier, date, arr_delay, time_hour) %>%
# Exclude missing data
na.omit() %>%
# For creating models, it is better to have qualitative columns
# encoded as factors (instead of character strings)
mutate_if(is.character, as.factor)
Check the number of delayed flights:
flight_data %>%
count(arr_delay) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## arr_delay n prop
## <fct> <int> <dbl>
## 1 late 52540 0.161
## 2 on_time 273279 0.839
Take a look at data types and data points:
glimpse(flight_data)
## Rows: 325,819
## Columns: 10
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558,…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 7…
## $ origin <fct> EWR, LGA, JFK, JFK, LGA, EWR, EWR, LGA, JFK, LGA, JFK, JFK,…
## $ dest <fct> IAH, IAH, MIA, BQN, ATL, ORD, FLL, IAD, MCO, ORD, PBI, TPA,…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, …
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028…
## $ carrier <fct> UA, UA, AA, B6, DL, UA, B6, EV, B6, AA, B6, B6, UA, UA, AA,…
## $ date <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01…
## $ arr_delay <fct> on_time, on_time, late, on_time, on_time, on_time, on_time,…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00…
Summarise the dataset:
flight_data %>%
skimr::skim(dest, carrier)
| Name | Piped data |
| Number of rows | 325819 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| dest | 0 | 1 | FALSE | 104 | ATL: 16771, ORD: 16507, LAX: 15942, BOS: 14948 |
| carrier | 0 | 1 | FALSE | 16 | UA: 57489, B6: 53715, EV: 50868, DL: 47465 |
Create training and test sets:
# Fix the random numbers by setting the seed
# This enables the analysis to be reproducible when random numbers are used
set.seed(555)
# Put 3/4 of the data into the training set
data_split <- initial_split(flight_data, prop = 3/4)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
Try typing ?initial_split in the console to get more details about the splitting function from rsample package.
Let’s initiate a new recipe:
flights_rec <-
recipe(arr_delay ~ ., data = train_data)
You can see more details about how to create recipes by typing ?recipe in the console.
Update variable roles of a recipe with update_role:
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID")
You can also read more about adding/updating/removing roles with ?roles.
To get the current set of variables and roles, use the summary() function:
summary(flights_rec)
## # A tibble: 10 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 dep_time numeric predictor original
## 2 flight numeric ID original
## 3 origin nominal predictor original
## 4 dest nominal predictor original
## 5 air_time numeric predictor original
## 6 distance numeric predictor original
## 7 carrier nominal predictor original
## 8 date date predictor original
## 9 time_hour date ID original
## 10 arr_delay nominal outcome original
What happens if we transform date column to numeric?
flight_data %>%
distinct(date) %>%
mutate(numeric_date = as.numeric(date))
## # A tibble: 364 x 2
## date numeric_date
## <date> <dbl>
## 1 2013-01-01 15706
## 2 2013-01-02 15707
## 3 2013-01-03 15708
## 4 2013-01-04 15709
## 5 2013-01-05 15710
## # … with 359 more rows
From date we can derive more meaningful features such as:
Add steps to your recipe to generate these features:
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date)
Check out help documents for these step functions with ?step_date, ?step_holiday, ?step_rm.
Create dummy variables using step_dummy():
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes())
Check if some destinations present in test set are not included in the training set:
test_data %>%
distinct(dest) %>%
anti_join(train_data)
## # A tibble: 1 x 1
## dest
## <fct>
## 1 LEX
Remove variables that contain only a single value with step_zv():
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
Recall the Build a model article.
This time we build a model specification for logistic regression using glm engine:
lr_mod <-
logistic_reg() %>%
set_engine("glm")
For more details try typing ?set_engine and ?glm in the console.
Bundle the model specification (lr_mod) with the recipe (flights_rec) to create a model workflow:
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_wflow
## ══ Workflow ═════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ─────────────────────────────────────────────────
## 5 Recipe Steps
##
## ● step_date()
## ● step_holiday()
## ● step_rm()
## ● step_dummy()
## ● step_zv()
##
## ── Model ────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Prepare the recipe and train the model:
flights_fit <-
flights_wflow %>%
fit(data = train_data)
Pull the fitted model object then use the broom::tidy() function to get a tidy tibble of model coefficients:
flights_fit %>%
pull_workflow_fit() %>%
tidy()
## # A tibble: 157 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.91 2.73 1.43 1.51e- 1
## 2 dep_time -0.00167 0.0000141 -118. 0.
## 3 air_time -0.0439 0.000561 -78.4 0.
## 4 distance 0.00686 0.00150 4.57 4.84e- 6
## 5 date_USChristmasDay 1.12 0.173 6.49 8.45e-11
## # … with 152 more rows
Simply apply fitted model to test_data and predict outcomes.
predict(flights_fit, test_data)
## # A tibble: 81,454 x 1
## .pred_class
## <fct>
## 1 on_time
## 2 on_time
## 3 on_time
## 4 on_time
## 5 on_time
## # … with 81,449 more rows
flights_pred <-
predict(flights_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(arr_delay, time_hour, flight))
# The data look like:
flights_pred
## # A tibble: 81,454 x 5
## .pred_late .pred_on_time arr_delay time_hour flight
## <dbl> <dbl> <fct> <dttm> <int>
## 1 0.0565 0.944 on_time 2013-01-01 05:00:00 1714
## 2 0.0264 0.974 on_time 2013-01-01 06:00:00 79
## 3 0.0481 0.952 on_time 2013-01-01 06:00:00 301
## 4 0.0325 0.967 on_time 2013-01-01 06:00:00 49
## 5 0.0711 0.929 on_time 2013-01-01 06:00:00 1187
## # … with 81,449 more rows